Chinese Tweets Segmentation based on Morphemes
نویسندگان
چکیده
Chinese tweets segmentation is a critical problem in natural language processing area. While segmentation of in-vocabulary words is well studied to date, few research findings are yet available concerning the prediction of new words on twitter. In this paper, we attempt to exploit multiple features for segmenting tweets in real text. To this end, we first take morpheme as the basic component units of Chinese words and thus investigate the relationship between Chinese new words and their internal morphological structures. Then, we explore both word internal cues and word external contextual features, and combine them for segmentation of Chinese new words using conditional random field. Our experimental results show that the incorporation of multiple features, especially the word-internal morphological features is of great value to Chinese tweets segmentation.
منابع مشابه
A Unified Framework for Text Analysis in Chinese TTS
This paper presents a robust text analysis system for Chinese text-tospeech synthesis. In this study, a lexicon word or a continuum of non-hanzi characters with the same category (e.g. a digit string) are defined as a morpheme, which is the basic unit forming a Chinese word. Based on this definition, the three key issues concerning the interpretation of real Chinese text, namely lexical disambi...
متن کاملDesign of Chinese Morphological Analyzer
This is a pilot study which aims at the design of a Chinese morphological analyzer which is in state to predict the syntactic and semantic properties of nominal, verbal and adjectival compounds. Morphological structures of compound words contain the essential information of knowing their syntactic and semantic characteristics. In particular, morphological analysis is a primary step for predicti...
متن کاملHybrid Models for Chinese Unknown Word Resolution Dissertation
Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistic...
متن کاملChinese Word Segmentation as LMR Tagging
In this paper we present Chinese word segmentation algorithms based on the socalled LMR tagging. Our LMR taggers are implemented with the Maximum Entropy Markov Model and we then use Transformation-Based Learning to combine the results of the two LMR taggers that scan the input in opposite directions. Our system achieves F-scores of and on the Academia Sinica corpus and the Hong Kong City Unive...
متن کاملEncoding motion events in Chinese and the “scalar specificity constraint”
Mandarin Chinese often expresses motion events with more than one verbal motion morpheme, e.g., 退 tui ‘recede’ and 回 hui ‘return’ in 退回房間裏 tui-hui fangjian-li recede-return room-inside ‘return into the room’. Building on recent work on “scale structure”, this paper proposes a “Motion Morpheme Hierarchy” that can be used to better predict the order of co-occurring motion morphemes: specifically,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012